6 - Clustering

Author

CDN team

Published

Last compiled on 6 June, 2024

Clustering

Introduction

This notebook picks up on the cell filtering from the QC notebook and circles back to some of the PCA content from notebook 4. Below, we cover essential Seurat functions as well as clustering metrics that give quantitative guidelines to how well the clustering process is performing.

Key Takeaways

  • one

  • two

  • three

Glossary

Resources

  1. Current best practices in single-cell RNA-seq analysis: a tutorial
  2. Bioconductor
  3. Single Cell Best Practices
  4. TKTK

Load Libraries and Data

Libraries

if (!requireNamespace("tidyverse", quietly = TRUE))
    install.packages('tidyverse')
if (!requireNamespace("Seurat", quietly = TRUE))
    install.packages('Seurat')
if (!requireNamespace("colorBlindness", quietly = TRUE))
    install.packages('colorBlindness')
if (!requireNamespace("RColorBrewer", quietly = TRUE))
    install.packages('RColorBrewer')
if (!requireNamespace("cluster", quietly = TRUE))
    install.packages('cluster')
suppressPackageStartupMessages({
  library(dplyr)
  library(Seurat)
  library(tidyverse)
  library(RColorBrewer)
  library(colorBlindness)
  library(DoubletFinder)
  library(cluster)
})
set.seed(687)

Load Data

# Load the Seurat object with doublet and batch information
se <- readRDS('../data/workshop-data-withQuality.rds')
se 
An object of class Seurat 
33234 features across 59572 samples within 1 assay 
Active assay: RNA (33234 features, 3000 variable features)
 3 layers present: counts, data, scale.data
 2 dimensional reductions calculated: pca, umap
# Set color palette
pal <- paletteMartin
names(pal) <- sort(unique(se$Celltype))

We left off (last session) filtering out the poor quality data without regard for its distribution. Last week, we learned how principle components are calculated, what a latent space is, and defined a kNN (k-nearest neighbors) graph.

To review those:

We take a cell x gene matrix, normalize it, and reduce its dimensionality using PCA. We looked at an Elbow plot to determine the best number of PCs that accurately show the variance explained. After selecting the top 30 PCs, we generated a k-nearest neighbors (kNN) graph to represent the relationships between cells based on their gene expression profiles and ran FindClusters() to identify clusters of cells based on their neighborhood relationships. Kyle introduced Harmony as an integration technique to correct for batch effects and I just went over QC metrics including doublet detection that help us understand the quality and content of our data.

Today, we’re going to: 1. Look at the FindClusters() function, its default parameters, and the algorithm(s) it relies on. 2. Plot a UMAP

Overview

PCA - A dimensionality reduction technique that reduces the number of dimensions in a dataset while retaining as much variance as possible. The first principal component accounts for the most variance in the data, and each subsequent component accounts for less variance.

Latent Space - The latent space is the low-dimensional representation of the data that captures the most important features.

kNN Graph - A graph that represents the relationships between cells based on their gene expression profiles. Each cell is connected to its k (1, 20, 100) nearest neighbors in the graph.

Preprocessing

UMAP of Counts and QC Metrics

Clustering

Before clustering, these are the preprocessing steps we took:

# se %>%
#     NormalizeData(verbose = FALSE) %>%
#     FindVariableFeatures(
#         method = "vst",
#         nfeatures = 3000,
#         verbose = FALSE) %>%
#     ScaleData(verbose = FALSE) %>% # Scales data to have mean 0 and variance 1
#     RunPCA(verbose = FALSE) 

Plots:

  • PCA = scree plot or biplot

  • cluster visualization or cluster heatmaps

PCA

Let’s start by visualizing PC1 and PC2 by Celltype.

# DimPlot allows us to color the latent space based on categorical values per cell, like cell type or which sample the cell is from
celltype_pca <- DimPlot(
        se,
        reduction = "pca",
        group.by = 'Celltype'
        ) 

celltype_pca

Construct the kNN graph:

FindNeighbors() is the Seurat function that calculates the K-nearest neighbors of each cell in PCA space. The number of neighbors used for this calculation is controlled by the k.param parameter in the FindNeighbors function. The default value is 30. The function also calculates the distance between each cell and its neighbors. The function computes pairwaise distances between cells based on their gene expression common ex: euclidean distance, manhattan distance, Pearson correlation distance, cosine similarity. choice matters.

From BioStars FindNeighbors() is a function that is used to find the nearest neighbors of your single cell data point within a dataset. It works by calculating the neighborhood overlap (Jaccard index) between every cell and its k. param nearest neighbors. It’s often employed in various applications such as anomaly detection, and dimensionality reduction. The concept is that given a data point, you want to identify the closest data points to it based on some similarity metric, such as Euclidean distance or cosine similarity. This helps to identify similar points in the dataset, which can be useful for making predictions or understanding the distribution of the data.*

FindNeighbors()

The default values of FindNeighbors() are:

FindNeighbors( object, reduction = "pca", dims = 1:10, assay = NULL, features = NULL, k.param = 20, return.neighbor = FALSE, compute.SNN = !return.neighbor, prune.SNN = 1/15, nn.method = "annoy", n.trees = 50, annoy.metric = "euclidean", nn.eps = 0, verbose = TRUE, do.plot = FALSE, graph.name = NULL, l2.norm = FALSE, cache.index = FALSE, ... )

Let’s modify these to fit our analysis:

se_1 <- se %>% 
    FindNeighbors( 
      object = se,
      reduction = "pca",
      dims = 1:30,
      k.param = 30,
      verbose = FALSE
    )
se_1@graphs
$RNA_nn
A Graph object containing 59572 cells
$RNA_snn
A Graph object containing 59572 cells

TKTK: is there any reason or way to look at the neighbors computed?

FindClusters

From BioStars: FindClusters() is a function used for clustering data points into groups or clusters based on their similarity. It uses a graph-based clustering approach and a Louvain algorithm. Clustering is an unsupervised learning technique where the algorithm groups similar cells together without any predefined labels. The goal is to find patterns and structure in your data. The number of clusters and the algorithm used can vary based on the problem and data characteristics. Common clustering algorithms include K-means, hierarchical clustering, and DBSCAN.

FindClusters() is used for identifying clusters of cells based on their neighborhood relationships typically obtained from PCA or t-SNE. It inputs the graph made from FindNeighbors and. outputs cluster assignments for each cell found at se@meta.data$seurat_clusters.

There are a couple of popular clustering algorithms. There is no one way to cluster as clustering is a means of looking at the data from different angles. The most popular clustering algortihms are the louvain algorithm, leiden algorithm, hierarchical clustering, and k-means clustering. Seruat uses the Leiden algorithm by default which is an improvement on the Louvain algorithm.

The resolution parameter controls the granularity of the clustering. Higher values of resolution will result in more clusters, while lower values will result in fewer clusters. The default value is 0.8.

In clustering, the goal is not to see how many clusters you can pull apart but it is an iterative process. especially in the first pass, you want to pull apart main cell groups such as epithelial cells and immune cells so you can further refine the clustering to get more granular cell types in the next iteration.

After clustering, we’ll review some cluster validation techniques to qualitatively and quantitatively check the quality of the clustering results.

FindClusters

The default values for FindClusters() are: FindClusters( object, graph.name = NULL, cluster.name = NULL, modularity.fxn = 1, initial.membership = NULL, node.sizes = NULL, resolution = 0.8, method = “matrix”, algorithm = 1, n.start = 10, n.iter = 10, random.seed = 0, group.singletons = TRUE, temp.file.location = NULL, edge.file.name = NULL, verbose = TRUE, … ) where when ‘algorithm’ = 1 original Louvain algorithm = 2 Louvain algorithm with multilevel refinement = 3 SLM (Smart Local Moving) algorithm = 4 Leiden algorithm See ?FindClusters for more information

We modify those to fit our object:

se_1 <- FindClusters(
      object = se_1,
      resolution = c(0.01, 0.05, 0.1, 0.15, 0.2,0.25)) 
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 59572
Number of edges: 3484220

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.9956
Number of communities: 5
Elapsed time: 19 seconds
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 59572
Number of edges: 3484220

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.9852
Number of communities: 10
Elapsed time: 20 seconds
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 59572
Number of edges: 3484220

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.9763
Number of communities: 13
Elapsed time: 21 seconds
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 59572
Number of edges: 3484220

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.9700
Number of communities: 15
Elapsed time: 24 seconds
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 59572
Number of edges: 3484220

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.9654
Number of communities: 17
Elapsed time: 22 seconds
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 59572
Number of edges: 3484220

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.9613
Number of communities: 20
Elapsed time: 24 seconds
# Find clusters based on nearest neighbors at a resolution of 0.5
# Results seen at RNA_snn_res in metadata
se_1@meta.data %>% select(contains("RNA_snn_res")) %>% colnames()
[1] "RNA_snn_res.0.01" "RNA_snn_res.0.05" "RNA_snn_res.0.1"  "RNA_snn_res.0.15" "RNA_snn_res.0.2"  "RNA_snn_res.0.25"

Louvain vs Leiden

TKTK Define “community detection algorithm”. Shared nearest neighbors, why we use sNN. Define unsupervised clustering. we’re computing this over the dimensionality reduction space that we chose.

The Louvian algorithm was developed in 2008 and is a most popular community detection algorithm widely used in scRNA-seq. It recursively merges communities into a single node and executes the modularity clustering on the condensed graphs.(Zhang) Both Seurat and scanpy use Louvain as the default clustering algorithm.


Leiden Algorithm

The Leiden algorithm was published in 2020 as an improvement of the Louvain algorithm. Leiden creates clusters by taking into account the number of links between cells in a cluster versus the overall expected number of links in the dataset. It computes a clustering on a KNN graph obtained from the PC reduced expression space. It starts with an initial partition where each node from its own community. Next, the algorithm moves single nodes from one community to another to find a partition, which is then refined. Based on a refined partition an aggregate network is generated, which is again refined until no further improvements can be obtained, and the final partition is reached. (Single Cell Best Practices)

RunUMAP

Show that one weird figure figure as it goes from one resolution to the next. Show that hierarchal clustering takes 2 groups and splits each group into more rgoups. but with graph based, a higher resolution splits the whole dataset into more groups. k-means is greedy algo splits into equal size clusters. Look for proliferating cells. Look at MS4A1 for proliferating cells. For choosing 0.05 as the resolution, compare to 0.1 and note the big blobs splitting therefore keep them together for first round of clustering and get into the weeds in level 2 annotation.

  • This function runs the UMAP algorithm on the PCA space. The dims parameter specifies which dimensions of the PCA space to use for the UMAP calculation. The default value is 1:30, which uses all dimensions. The n.components parameter specifies the number of dimensions in the UMAP embedding. The default value is 2.
se_1 <- se_1 %>% 
    RunUMAP(dims = 1:30, verbose = FALSE) # Run UMAP 

Visualize Clusters

DimPlot(
    se_1,
    group.by = c(
        "RNA_snn_res.0.01", "RNA_snn_res.0.05",
        "RNA_snn_res.0.1", "RNA_snn_res.0.15",
        "RNA_snn_res.0.2", "RNA_snn_res.0.25"),
    label = TRUE)

Different Resolutions

0.01 vs 0.05 Resolution

DimPlot(
    se_1,
    group.by = c(
        "Celltype","RNA_snn_res.0.05", "RNA_snn_res.0.1"),
    label = TRUE)

Let’s go with 0.05 and perform cluster validation.

Cluster Metrics - most important part.

Cluster Diversity

1 - Number of Cells in each Cluster

seurat_clusters <- "RNA_snn_res.0.05"
diversityPerGroup <- function(df, species, group, diversity_metric = 'simpson') {
  # Convert strings to symbols for curly-curly operator
  species_sym <- rlang::sym(species)
  group_sym <- rlang::sym(group)
  # Count groups per species directly using curly-curly
  tierNcount <- df %>%
    group_by({{species_sym}}) %>%
    count({{group_sym}}, name = "n") %>% ungroup
  # Pivot table to allow vegan::diversity call
  tierNwide <- tierNcount %>%
    pivot_wider(names_from = {{group_sym}}, values_from = n, values_fill = list(n = 0))
  # Use rownames of species for the diversity function, which needs a dataframe
  tierNwide_df <- as.data.frame(tierNwide)
  # species column must be first
  tierNwide_df <- tierNwide_df %>% select({{species}}, everything())
  rownames(tierNwide_df) <- tierNwide_df[, 1]
  tierNwide_df <- tierNwide_df[, -1]
  # Calculate diversity
  diversity_values <- vegan::diversity(tierNwide_df, index = diversity_metric)
  # Prepare result as a dataframe
  result <- data.frame(
    species = rownames(tierNwide_df),
    diversity = diversity_values,
    row.names = NULL
  )
  # Rename diversity column
  names(result)[1] <- species
  names(result)[2] <- sprintf('%s_diversity', group)
  return(result)
}

# Calculate simpson's diversity per cluster
clusterMetrics <- diversityPerGroup(se_1@meta.data,
                        species = 'RNA_snn_res.0.05',
                        group = 'Sample ID',
                        diversity_metric = 'simpson')

# Calculate number of cells per cluster and join to metrics table
clusterMetrics <- clusterMetrics %>% left_join(se_1@meta.data %>% count(RNA_snn_res.0.05))

# clusterMetrics

# p1 <- ggplot(clusterMetrics, aes(x = Celltype, y = n)) +
#   geom_bar(stat = "identity", fill = 'black') +
#   scale_y_log10() +
#   labs(x = "Cell Type", y = "Cell Number (log scale)") +
#   theme_minimal() +
#   theme(axis.text.x = element_text(angle = 45, hjust = 1))
library(viridis)
seurat_clusters <- "RNA_snn_res.0.05"
clusterMetrics$seurat_clusters <- as.numeric(clusterMetrics$RNA_snn_res.0.05)

lollipop <- ggplot(clusterMetrics, aes(x = seurat_clusters, y = n)) +
  geom_segment(aes(x = seurat_clusters, xend = seurat_clusters, y = 0, yend = n),
               size = 1.5, color = 'grey80') + # Draw lines for lollipops
  geom_point(aes(colour = `Sample ID_diversity`), size = 5) + # Add colored lollipop 'heads', coloring by 'Sample ID_diversity'
  scale_y_log10() +
  scale_x_continuous(breaks = seq(0,20)) + 
  scale_colour_viridis(option = "C", name = "Sample ID Diversity", direction = 1, limits = c(0,1)) + # Colorblind-friendly, vibrant color palette
  theme_minimal(base_size = 10) +
  theme(legend.position = "bottom",
        axis.text = element_text(size = 12), 
        axis.title = element_text(size = 14), 
        title = element_text(size = 16)) +
  labs(x = "Seurat Clusters",
       y = "cluster size (log-scaled)",
       title = "Cluster Diversity Metrics") +
  guides(colour = guide_colourbar(title.position = "top", title.hjust = 0.5))

lollipop

2 - Which samples are in each cluster

# Plot sample distribution per cluster

# Load necessary packages
library(ggplot2)
library(dplyr)
library(tidyr)

# Assuming se_1@meta.data has columns 'seurat_clusters' and 'Sample ID'

# Prepare the data for plotting
plot_sample <- se_1@meta.data %>%
  count(RNA_snn_res.0.05, `Sample ID`) %>%
  group_by(RNA_snn_res.0.05) %>%
  mutate(proportion = n / sum(n)) %>%
  ungroup()

# Create a stacked bar plot
ggplot(plot_sample, aes(x = factor(RNA_snn_res.0.05), y = proportion, fill = `Sample ID`)) +
  geom_bar(stat = "identity", position = "stack") +
  labs(x = "Seurat Clusters", y = "Proportion", fill = "Sample ID",
       title = "Distribution of Sample IDs Across Clusters") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_fill_manual(values = pal)

# TD: use Marc's color scheme for donors- factor analysis
# Prepare the data for plotting
plot_batch <- se_1@meta.data %>%
  count(RNA_snn_res.0.05, batch) %>%
  group_by(RNA_snn_res.0.05) %>%
  mutate(proportion = n / sum(n)) %>%
  ungroup()

# Create a stacked bar plot
ggplot(plot_batch, aes(x = factor(RNA_snn_res.0.05), y = proportion, fill = batch)) +
  geom_bar(stat = "identity", position = "stack") +
  labs(x = "Seurat Clusters", y = "Proportion", fill = "Batch",
       title = "Distribution of Batches Across Clusters") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

3 - Stacked Bar Plot of Cluster Diversity

p2 <- ggplot(clusterMetrics, aes(y=as.character(seurat_clusters), fill=`Sample ID_diversity`, x = 1, label = n)) +
  geom_tile(colour = "white") +
  geom_text(nudge_x = 1.5, size = 3) +
  geom_text(aes(label = signif(`Sample ID_diversity`, 2)),size = 3) +
  scale_fill_distiller(palette = "Spectral", limits = c(0,1)) + theme_classic() +
  coord_fixed(ratio = 0.25) + 
  expand_limits(x = c(0.5,3)) +
  labs(x = "Diversity            Size") +
  theme(axis.text.y = element_text(hjust = 1, vjust = 0.5, size = 12),
        axis.text.x = element_blank(),
        axis.ticks.x = element_blank(),
        axis.title.y = element_text(size = 15),
        legend.key.size = unit(1, 'cm'),
        legend.title = element_text(size=10), 
        legend.text = element_text(size=10)
  )
p2

Silhouette Analysis

As covered in the last workshop, silhouette analysis is a way to measure how similar an object is to its own cluster compared to other clusters. The silhouette value ranges from -1 to 1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters.

seurat_clusters <- se_1@meta.data$RNA_snn_res.0.05

pca_embeddings <- Embeddings(se, reduction = 'pca')

# Calculate silhouette widths
sil_scores <- silhouette(x = as.numeric(seurat_clusters), dist = dist(pca_embeddings))

# Extract silhouette scores
silhouette_data <- as.data.frame(sil_scores)
# Recover cell type names
silhouette_data$seurat_clusters <- as.character(seurat_clusters)

row.names(silhouette_data) <- row.names(pca_embeddings)

silhouette_arranged <- silhouette_data %>% 
  group_by(seurat_clusters) %>% 
  arrange(-sil_width)
overall_average <- silhouette_arranged %>% 
  ungroup %>% 
  summarize(ave = as.numeric(mean(sil_width))) %>% 
  pull(ave)

full_plot <- ggplot(silhouette_arranged, 
                    aes(x = sil_width, 
                        y = seurat_clusters, 
                        fill = seurat_clusters, 
                        group = seurat_clusters)) +
    geom_bar(stat = "identity", position = 'dodge2') +
    geom_vline(xintercept = overall_average,
               color = 'red',
               linetype = 'longdash') +
    theme_minimal() +
    labs(title = "Silhouette Analysis",
        y = "Cluster",
        x = "Silhouette width",
        fill = "Cluster") +
    theme(axis.text.y = element_text(hjust = 1, vjust = 0.5, size = 20),
        axis.text.x = element_blank(),
        axis.ticks.x = element_blank(),
        axis.title.y = element_text(size = 20),
        legend.position = "None")

full_plot

Look at silhouette scores on a UMAP

d5 <- DimPlot(se_1,
        reduction='umap',
        group.by='RNA_snn_res.0.05')

se_1$CellID <- row.names(se_1@meta.data)

sil_ids <- silhouette_data %>% rownames_to_column('CellID') %>% left_join(se_1@meta.data, by='CellID')
head(silhouette_data)
                    cluster neighbor sil_width seurat_clusters
AAACCCAAGAATGTTG-12       1        3 0.4380270               0
AAACCCAAGCATTGTC-19       1        3 0.3523827               0
AAACCCAAGCTACGTT-6        1        6 0.3017238               0
AAACCCAAGGCCGCTT-3        5        7 0.2647330               4
AAACCCAAGGCCTGCT-12       1        3 0.2760281               0
AAACCCAAGGGCAATC-1        3        1 0.4495139               2
head(se_1@meta.data)
                       orig.ident nCount_RNA nFeature_RNA Sample ID    Disease group                                                                                                                        Comorbidity   Hospital day WBC per microL Neutrophil per microL (%) Lymphocyte per microL (%) Monocyte prt microL (%) C-reactive protein (mg per dL)                                                Chest X-ray                                               Treatment Respiratory rate (BPM)  O2 saturation  O2 supplement    Temperature    Systolic BP Heart rate (BPM)  Consciousness     NEWS score       Severity           Celltype Number of UMI Number of Gene Percentage of mitochondrial gene Age tissue_ontology_term_id assay_ontology_term_id disease_ontology_term_id cell_type_ontology_term_id self_reported_ethnicity_ontology_term_id development_stage_ontology_term_id sex_ontology_term_id is_primary_data organism_ontology_term_id donor_id suspension_type tissue_type                                cell_type     assay   disease     organism    sex tissue self_reported_ethnicity       development_stage observation_joinid  perc.mt perc.ribo    perc.hb Experimental batch batch       pANN
AAACCCAAGAATGTTG-12 SeuratProject       5322         1910    nCoV 6    mild COVID-19                                                                                                    diabetes mellitus, dyslipidemia              3           5590               3119 (55.8)               1834 (32.8)              637 (11.4)                           0.59                                                  pneumonia             lopinavir and ritonavir, hydroxychloroquine                     18             97       Room air           37.3             92               80          Alert              2           mild            NK cell          5322           1910                         3.100338  38          UBERON:0000178            EFO:0009922            MONDO:0100096                 CL:0000623                                  unknown                     HsapDv:0000132         PATO:0000383            TRUE            NCBITaxon:9606   nCoV 6            cell      tissue                      natural killer cell 10x 3' v3  COVID-19 Homo sapiens female  blood                 unknown 38-year-old human stage         1s~)u&E!VB 3.100338  23.22435 0.01878993                  3     3 0.21890415
AAACCCAAGCATTGTC-19 SeuratProject       7644         2292  Normal 4    Healthy Donor                                                                                                                     not applicable not applicable not applicable            not applicable            not applicable          not applicable                 not applicable                                             not applicable                                          not applicable         not applicable not applicable not applicable not applicable not applicable   not applicable not applicable not applicable not applicable       CD4, EM-like          7646           2294                         7.847240  63          UBERON:0000178            EFO:0009922             PATO:0000461                 CL:0001044                                  unknown                     HsapDv:0000157         PATO:0000384            TRUE            NCBITaxon:9606 Normal 4            cell      tissue effector CD4-positive, alpha-beta T cell 10x 3' v3    normal Homo sapiens   male  blood                 unknown 63-year-old human stage         T<AIMsSxFw 7.849294  29.72266 0.02616431                  4     4 0.05454545
AAACCCAAGCTACGTT-6  SeuratProject       5406         1945     Flu 3 severe influenza kidney transplantation status due to ESRD, atrial fibrillation, stroke, hypertension, diabetes, history of  pulmonary tuberculosis            nan          15400              12813 (83.2)                 755 (4.9)             1725 (11.2)                           6.65 total collapse LLL, pleural effusion, pericardial effusion                                                     nan                    nan not applicable not applicable not applicable not applicable   not applicable not applicable not applicable not applicable   CD8, non-EM-like          5409           1948                         9.521168  70          UBERON:0000178            EFO:0009922            MONDO:0005812                 CL:0000625                                  unknown                     HsapDv:0000164         PATO:0000384            TRUE            NCBITaxon:9606    Flu 3            cell      tissue          CD8-positive, alpha-beta T cell 10x 3' v3 influenza Homo sapiens   male  blood                 unknown 70-year-old human stage         u8rl}V-`Kn 9.526452  19.18239 0.00000000                  2     2 0.03508772
AAACCCAAGGCCGCTT-3  SeuratProject       3981         1583     Flu 1 severe influenza                                                                                                 hypertension, rheumatoid arthritis            nan          12900              10655 (82.6)                1174 (9.1)              1032 (8.0)                           9.39      peribronchial opacity in both lungs, pleural effusion                                                     nan                    nan not applicable not applicable not applicable not applicable   not applicable not applicable not applicable not applicable classical Monocyte          3982           1584                         6.428930  68          UBERON:0000178            EFO:0009922            MONDO:0005812                 CL:0000860                                  unknown                     HsapDv:0000162         PATO:0000384            TRUE            NCBITaxon:9606    Flu 1            cell      tissue                       classical monocyte 10x 3' v3 influenza Homo sapiens   male  blood                 unknown 68-year-old human stage         _)<TM?x-7b 6.430545  11.93168 0.07535795                  1     1 0.16921606
AAACCCAAGGCCTGCT-12 SeuratProject       3360         1537    nCoV 6    mild COVID-19                                                                                                    diabetes mellitus, dyslipidemia              3           5590               3119 (55.8)               1834 (32.8)              637 (11.4)                           0.59                                                  pneumonia             lopinavir and ritonavir, hydroxychloroquine                     18             97       Room air           37.3             92               80          Alert              2           mild            NK cell          3361           1538                         6.486165  38          UBERON:0000178            EFO:0009922            MONDO:0100096                 CL:0000623                                  unknown                     HsapDv:0000132         PATO:0000383            TRUE            NCBITaxon:9606   nCoV 6            cell      tissue                      natural killer cell 10x 3' v3  COVID-19 Homo sapiens female  blood                 unknown 38-year-old human stage         V$~fT4_K{L 6.488095  12.79762 0.00000000                  3     3 0.20957206
AAACCCAAGGGCAATC-1  SeuratProject      15674         3308    nCoV 1  severe COVID-19                                                                                                                     no comorbidity             16          21540              19235 (89.3)                1055 (4.9)              1271 (5.9)                           7.58                                                  pneumonia lopinavir and ritonavir, hydroxychloroquine, nafamostat                     24             90      ECMO + MV           37.6             92              122   Unresponsive             14         severe       B cell, IgG+         15679           3311                         7.296384  63          UBERON:0000178            EFO:0009922            MONDO:0100096                 CL:0000979                                  unknown                     HsapDv:0000157         PATO:0000384            TRUE            NCBITaxon:9606   nCoV 1            cell      tissue                        IgG memory B cell 10x 3' v3  COVID-19 Homo sapiens   male  blood                 unknown 63-year-old human stage         W+utHdOui6 7.298711  38.08855 0.01913998                  1     1 0.29302103
                    DF_class doublet_run    pK   pN      quality RNA_snn_res.0.01 RNA_snn_res.0.05 RNA_snn_res.0.1 RNA_snn_res.0.15 RNA_snn_res.0.2 RNA_snn_res.0.25 seurat_clusters              CellID
AAACCCAAGAATGTTG-12  Singlet        4142   0.3 0.25 good quality                0                0               2                1               0                0               0 AAACCCAAGAATGTTG-12
AAACCCAAGCATTGTC-19  Singlet        4185 0.005 0.25 good quality                0                0               0                4               4                4               4 AAACCCAAGCATTGTC-19
AAACCCAAGCTACGTT-6   Singlet        4120 0.005 0.25 good quality                0                0               0                2               1                1               1  AAACCCAAGCTACGTT-6
AAACCCAAGGCCGCTT-3   Singlet        4180   0.1 0.25 good quality                4                4               5                6               7                6               6  AAACCCAAGGCCGCTT-3
AAACCCAAGGCCTGCT-12  Singlet        4142   0.3 0.25 good quality                0                0               2                1               0                0               0 AAACCCAAGGCCTGCT-12
AAACCCAAGGGCAATC-1   Doublet        4180   0.1 0.25 good quality                2                2               3                3               3                3               3  AAACCCAAGGGCAATC-1
se_1 <- AddMetaData(se_1, sil_ids)

FeaturePlot(se_1, feature = "sil_width") + ggtitle('Silhouette width') + scale_color_viridis_c(limits = c(-1,1), option = "magma") | d5

# TD: change color palette
library(ggplot2)

# Convert the 'batch' variable to a character
se@meta.data$batch <- as.character(se@meta.data$batch)
d1 <- DimPlot(se,
        reduction='umap',
        group.by='batch')

d2 <- DimPlot(se,
        reduction='umap',
        group.by='DF_class')

d3 <- DimPlot(se,
        reduction='umap',
        group.by='Sample ID')

d4 <- DimPlot(se,
        reduction='umap',
        group.by='Celltype')

d1 | d2

d3 | d4

Session Info

sessionInfo()
R version 4.3.2 (2023-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.5

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] viridis_0.6.5        viridisLite_0.4.2    cluster_2.1.6        DoubletFinder_2.0.4  colorBlindness_0.1.9 RColorBrewer_1.1-3   lubridate_1.9.3      forcats_1.0.0        stringr_1.5.1        purrr_1.0.2          readr_2.1.5          tidyr_1.3.1          tibble_3.2.1         ggplot2_3.5.1        tidyverse_2.0.0      Seurat_5.0.3         SeuratObject_5.0.2   sp_2.1-4             dplyr_1.1.4         

loaded via a namespace (and not attached):
  [1] rstudioapi_0.16.0      jsonlite_1.8.8         magrittr_2.0.3         spatstat.utils_3.0-4   farver_2.1.2           rmarkdown_2.27         vctrs_0.6.5            ROCR_1.0-11            spatstat.explore_3.2-7 htmltools_0.5.8.1      gridGraphics_0.5-1     sctransform_0.4.1      parallelly_1.37.1      KernSmooth_2.23-22     htmlwidgets_1.6.4      ica_1.0-3              plyr_1.8.9             plotly_4.10.4          zoo_1.8-12             igraph_2.0.3           mime_0.12              lifecycle_1.0.4        pkgconfig_2.0.3        Matrix_1.6-5           R6_2.5.1               fastmap_1.2.0          fitdistrplus_1.1-11    future_1.33.2          shiny_1.8.1.1          digest_0.6.35          colorspace_2.1-0       patchwork_1.2.0        tensor_1.5             RSpectra_0.16-1        irlba_2.3.5.1          vegan_2.6-6            labeling_0.4.3         progressr_0.14.0       fansi_1.0.6            spatstat.sparse_3.0-3  timechange_0.3.0       mgcv_1.9-1             httr_1.4.7             polyclip_1.10-6        abind_1.4-5            compiler_4.3.2         withr_3.0.0            fastDummies_1.7.3      MASS_7.3-60.0.1        permute_0.9-7          tools_4.3.2           
 [52] lmtest_0.9-40          httpuv_1.6.15          future.apply_1.11.2    goftest_1.2-3          glue_1.7.0             nlme_3.1-164           promises_1.3.0         grid_4.3.2             Rtsne_0.17             reshape2_1.4.4         generics_0.1.3         gtable_0.3.5           spatstat.data_3.0-4    tzdb_0.4.0             data.table_1.15.4      hms_1.1.3              utf8_1.2.4             spatstat.geom_3.2-9    RcppAnnoy_0.0.22       ggrepel_0.9.5          RANN_2.6.1             pillar_1.9.0           spam_2.10-0            RcppHNSW_0.6.0         later_1.3.2            splines_4.3.2          lattice_0.22-6         survival_3.6-4         deldir_2.0-4           tidyselect_1.2.1       miniUI_0.1.1.1         pbapply_1.7-2          knitr_1.45             gridExtra_2.3          scattermore_1.2        xfun_0.44              matrixStats_1.3.0      stringi_1.8.4          lazyeval_0.2.2         yaml_2.3.8             evaluate_0.23          codetools_0.2-20       cli_3.6.2              uwot_0.1.16            xtable_1.8-4           reticulate_1.36.1      munsell_0.5.1          Rcpp_1.0.12            globals_0.16.3         spatstat.random_3.2-3  png_0.1-8             
[103] parallel_4.3.2         dotCall64_1.1-1        listenv_0.9.1          scales_1.3.0           ggridges_0.5.6         leiden_0.4.3.1         rlang_1.1.3            cowplot_1.1.3